Skip to content

feat(mtagro): enable muti-rank usage#4

Open
HaoZeke wants to merge 19 commits intometatomicfrom
realDomDec
Open

feat(mtagro): enable muti-rank usage#4
HaoZeke wants to merge 19 commits intometatomicfrom
realDomDec

Conversation

@HaoZeke
Copy link
Member

@HaoZeke HaoZeke commented Jan 30, 2026

Look away until

Or "real domain decomposition"..

Basically the LAMMPS style where the model loads everywhere, computes on every rank.

WIP. Needs testing to ensure consistency.

All set. Closes #7.

@HaoZeke HaoZeke changed the base branch from metatomic to noVesin January 30, 2026 11:44
@HaoZeke HaoZeke marked this pull request as ready for review January 30, 2026 11:44
@HaoZeke HaoZeke marked this pull request as draft January 30, 2026 11:49
@HaoZeke HaoZeke mentioned this pull request Feb 2, 2026
Base automatically changed from noVesin to metatomic February 3, 2026 14:40
@HaoZeke HaoZeke force-pushed the realDomDec branch 5 times, most recently from 5c0cd25 to d3505c6 Compare February 4, 2026 17:18
@HaoZeke HaoZeke marked this pull request as ready for review February 15, 2026 07:36
@HaoZeke
Copy link
Member Author

HaoZeke commented Feb 15, 2026

This is now consistent, with mpirun -n 1 gmx_mpi mdrun all the way up to -n 12 on the example from https://github.com/HaoZeke/pixi_envs/tree/main/orgs/metatensor/gromacs/mta_test

since GMX_LOG is rank 0 only, and cerr is not allowed as per GROMACS
style guides
1. The bonded interaction building (make_bondeds_zone) runs for all
zones but won't find any cross-zone bondeds when !hasInterAtomicInteractions() — it's a no-op for the extra zones
2. The exclusion building (make_exclusions_zone) correctly builds exclusion entries for all i-zone atoms, satisfying the pairlist assertion

Added nzone_bondeds = std::max(nzone_bondeds, numIZonesForExclusions) to
ensure the exclusion building loop covers all i-zones when
intermolecularExclusionGroup is present.

Without this, 3D DD (e.g., 2x2x2 with 8 ranks) has numIZones=4 but
nzone_bondeds=1, so exclusion lists are only built for zone 0 atoms while the nbnxm assertion expects them for zones 0-3.
  1. localToModelIndex_ sized to numLocalPlusHalo instead of signal.x_.size() (was OOB write)
  2. augmentGhostPairs rewritten to correctly identify halo MTA atoms by iterating localToModelIndex_ from numLocalAtoms_ onward, instead of
  incorrectly slicing the full coordinate array
  3. Shift vector computed as pair.dx() - (positions_[B] - positions_[A]) in model-space, then rounded to integer cell shifts and recomputed from
  box vectors for consistency
  4. Deduplication of pairs using std::set<tuple> to handle overlap between signal pairs and augmented halo-halo pairs
  5. Timer instrumentation via MetatomicTimer RAII class around key phases
Comment on lines +242 to +244
// TODO: For multi-layer GNN models (MACE, NequIP, etc.) the
// interaction_range should be n_layers * cutoff so that DD halos
// are deep enough for message-passing. Many models currently
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is nothing we can do here. The model has to declare the correct interaction_range property:

https://docs.metatensor.org/metatomic/latest/torch/reference/models/metadata.html#metatomic.torch.ModelCapabilities

We also say that this has to include the message passing layers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I mean we have max_cutoff stuff

Comment on lines +168 to +171
// With thread-MPI, each rank is a thread sharing the same process.
// PyTorch's internal OpenMP would spawn N threads per rank, causing
// massive oversubscription (e.g. 12 ranks × 12 OMP threads = 144
// threads on 12 cores). Force single-threaded torch operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we still oversubscribe even with this check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but that's on the user, without it Pytorch separately tries to use the OMP variable

if (fp)
{
data_->dtype = torch::kFloat32;
std::fprintf(fp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this use the gromacs log mechanism instead of a pure fprint ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the GROMACS logger only works on the main rank, the docs suggest not using iostreams

Use STL, but do not use iostreams outside of the unit tests. iostreams can have a negative impact on performance compared to other forms of string streams, depending on the use case. Also, they don’t always play well with using C stdio routines at the same time, which are used extensively in the current code. However, since Google tests rely on iostreams, you should use it in the unit test code.

So fprintf seemed like the best choice, it's also used all over the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the timer a lot. However you said there is a GROMACS internal timer. The question is if we have to use this to get the code upstream. But maybe we think about this once we start with this step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, currently it is env var gated, but maybe they'd prefer us using theirs.

@HaoZeke HaoZeke requested a review from PicoCentauri February 16, 2026 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(install): fixup RPATH

2 participants